Statistical analysis of a multi-objective solution set from Quantum Particle Swarm Optimization model : application to China datas

I - Introduction

In [1]:
from IPython.display import Latex

I.1 - Context

The interpretation of U-Th data relies in this case on a time evolution law of the U-series nuclides, which is the classical law used to account for the variations of the radioactive disequilibria during weathering, namely a loss and gain model , . In the frame of such a scenario the temporal evolution of the number of the atoms of the five 238U-234U-230Th-232Th-226Ra nuclides per gram of sample is described by the four following equations:

\begin{eqnarray}\label{eqn:model} \hspace{-1.5cm} \left\{ \begin{array}{lll} \dfrac{\partial {}^{238}\mbox{U}}{\partial t}+(k_{238}+\lambda_{238})\,{}^{238}\mbox{U}&=&f_{238}\,{}^{238}\mbox{U}_0,\\ \dfrac{\partial{}^{234}\mbox{U}}{\partial t}+(k_{234}+\lambda_{234})\,{}^{234}\mbox{U}&=&f_{234}\,{}^{234}\mbox{U}_0+\lambda_{238}\,{}^{238}\mbox{U},\\ \dfrac{\partial{}^{230}\mbox{Th}}{\partial t}+(k_{230}+\lambda_{230})\,{}^{230}\mbox{Th}&=&f_{230}\,{}^{230}\mbox{Th}_0+\lambda_{234}\,{}^{234}\mbox{U},\\ \dfrac{\partial{}^{232}\mbox{Th}}{\partial t}+(k_{232}+\lambda_{232})\,{}^{232}\mbox{Th}&=&f_{232}\,{}^{232}\mbox{Th}_0,\\ \dfrac{\partial{}^{226}\mbox{Ra}}{\partial t}+(k_{226}+\lambda_{226})\,{}^{226}\mbox{Ra}&=&f_{226}\,{}^{226}\mbox{Ra}_0+\lambda_{230}\,{}^{230}\mbox{Th},\\ \end{array} \right. \end{eqnarray}

where $\lambda_i$ are the decay constants (in $y^{-1}$) of nuclides $i$ (here 238U, 234U, 230Th, 232Th and 226Ra), $k_i$ are the first-order rate constants (in $y^{-1}$) for leaching of nuclides $i$ and $f_i$ are the input fluxes (in $y^{-1}$) of nuclides $i$ gained by the regolith.

For simplification, the input fluxes are expressed as the proportion of the number of atoms of nuclides added per year to the initial sample (expressed as ${\lambda}_i N_0$).

Solving the above equation system allows to calculate the theoretical activities of the various nuclides in the sample at time t and the four independent activity ratios (234U/238U), (230Th/238U), (230Th/232Th) and (226Ra/???).

The nuclides activities and the activity ratios are a function of the mobility parameters ($k_i$, $f_i$) of the model, the initial activities of the different nuclides and the time $t$. The time $t$ corresponds to the time span between the initial state of the sample at $t=0$ and its present state.

Thus, the time $t$ corresponds to the time duration elapsed for a sample to move from the starting position to its current position. Usually the reference sample is collected at a locality upstream in the plain, which corresponds to the starting or reference position. It is important to note that such an approach implies that the transfer of sediments within the plain works at steady state.

The mobility parameters ($k_i$, $f_i$) are usually unknown; therefore, the analysis of the radionuclides and the related radioactive disequilibria in two samples are not sufficient to calculate the time $t$. To solve this problem a series of samples has to be collected in soil at different depths. Due to the complexity of the complete China sample, the complete set has been boxed into four sub-samples in wich the mobility parameters are considered constant.

The determined radioactive disequilibria of these samples are then used to determine (1) the mobility parameters of the different radionuclides and (2) the age of the different samples relative to the reference sample; this age corresponds to a production rate $p$ and is related to the time span necessary to move from the reference position to its current position.

I.2 - Statistical framework

The same statistical framework has been conducted for each profil and each parameter. For fitting and computing probability density functions (PDF), we used Python packages scipy.stats, seaborn for statistical analysis, pandas to store/access large data frames and plotly for interactive graphic rendering. A hundred of bins was used to perform initial distributions.

Afterwards, a kernel density estimation (KDE) is applied to provide an estimate of underlying distributions. The KDE algorithm takes a parameter, bandwidth, that affects how "smooth" the resulting curve is. We use the same bandwidth $n^{-{1}/{5}}=0.0725$ where $n$ is the size of the population ($n \approx 500~000$).

Thus, obtained KDE curve has been fitted on a normal distribution (normalization) in order to compute mean $\mu$ and standard deviation $\sigma$.

I.3 - Kernel density estimation on soil profiles distribution

In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. (source wikipedia)

Remarque : Un lien intéressant (en tout cas pour moi !) pour mieux comprendre sa construction How to build KDE function

II - Bibliography

III - Material and method

In [12]:
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import plotly.figure_factory as ff
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import matplotlib.ticker as tick
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

III.1 - Datas and parameters set for analysis

In [2]:
labels=['Profil 2 ','Profil 3a','Profil 4a','Profil 4b']
var = ['Time','P','k238','f238','k234','f234','k(234/238)','f(234/238)']
colors = ['blue','red','green','magenta']

Using pandas package to read .csv output results files:

In [3]:
# Load using Pandas module csv file
prof_2 = pd.read_csv('zone_2.csv',names=['ID_BEE','Fobj','Time','P','k238','f238','k234','f234','k(234/238)','f(234/238)','k238/f238','k234/f234'],encoding='utf-8',delimiter="\t",dtype=np.float64)
prof_3 = pd.read_csv('zone_3a.csv',names=['ID_BEE','Fobj','Time','P','k238','f238','k234','f234','k(234/238)','f(234/238)','k238/f238','k234/f234'],encoding='utf-8',delimiter="\t",dtype=np.float64)
prof_4 = pd.read_csv('zone_4a.csv',names=['ID_BEE','Fobj','Time','P','k238','f238','k234','f234','k(234/238)','f(234/238)','k238/f238','k234/f234'],encoding='utf-8',delimiter="\t",dtype=np.float64)
prof_5 = pd.read_csv('zone_4b.csv',names=['Fobj','Time','P','k238','f238','k234','f234','k(234/238)','f(234/238)','k238/f238','k234/f234'],encoding='utf-8',delimiter="\t",dtype=np.float64)

Store data into pandas.DataFrame container and read first 5 lines of each table:

In [4]:
# Store data in data frame object from Pandas module
df_p2 = pd.DataFrame(prof_2,columns=['Fobj','Time','P','k238','f238','k234','f234','k(234/238)','f(234/238)'])
df_p3 = pd.DataFrame(prof_3,columns=['Fobj','Time','P','k238','f238','k234','f234','k(234/238)','f(234/238)'])
df_p4 = pd.DataFrame(prof_4,columns=['Fobj','Time','P','k238','f238','k234','f234','k(234/238)','f(234/238)'])
df_p5 = pd.DataFrame(prof_5,columns=['Fobj','Time','P','k238','f238','k234','f234','k(234/238)','f(234/238)'])
In [5]:
# Print first 5 results
print("Profil 2 :",prof_2.head())
print(30*'---')
print("Profil 3a :",prof_3.head())
print(30*'---')
print("Profil 4a :",prof_4.head())
print(30*'---')
print("Profil 4b :",prof_5.head())
Profil 2 :    ID_BEE     Fobj      Time         P          k238      f238      k234  \
0     2.0  2.60108  530734.0  3.768366  8.525980e-07  0.000002  0.000002   
1     3.0  1.36673  699205.0  2.860391  8.053690e-07  0.000002  0.000002   
2     4.0  1.72621  633165.0  3.158734  6.251360e-07  0.000001  0.000002   
3     5.0  2.07166  645396.0  3.098873  8.461530e-07  0.000002  0.000002   
4    11.0  1.28323  745205.0  2.683825  8.395180e-07  0.000002  0.000002   

       f234  k(234/238)  f(234/238)  k238/f238  k234/f234  
0  0.000003     2.59294     2.00407   0.496320   0.642156  
1  0.000003     1.88402     1.57211   0.456156   0.546658  
2  0.000003     2.66639     1.95487   0.430792   0.587591  
3  0.000003     2.34165     1.85392   0.483514   0.610718  
4  0.000003     2.14020     1.76854   0.504751   0.610826  
------------------------------------------------------------------------------------------
Profil 3a :    ID_BEE     Fobj      Time         P          k238      f238          k234  \
0     2.0  2.22569  263302.0  3.038336  5.847340e-07  0.000002  9.274450e-07   
1    10.0  2.98858  284930.0  2.807707  6.083430e-07  0.000002  1.052830e-06   
2    11.0  2.81116  275668.0  2.902042  5.300900e-07  0.000002  9.465310e-07   
3    15.0  1.55713  280458.0  2.852477  5.302180e-07  0.000002  7.563530e-07   
4    18.0  2.53883  263006.0  3.041756  6.071970e-07  0.000002  9.285730e-07   

       f234  k(234/238)  f(234/238)  k238/f238  k234/f234  
0  0.000002     1.58610     1.25092   0.295459   0.374624  
1  0.000002     1.73066     1.26298   0.308851   0.423219  
2  0.000002     1.78560     1.30330   0.276790   0.379219  
3  0.000002     1.42649     1.16446   0.307052   0.376146  
4  0.000002     1.52928     1.18420   0.320471   0.413859  
------------------------------------------------------------------------------------------
Profil 4a :    ID_BEE     Fobj      Time         P      k238          f238      k234  \
0     1.0  2.69455  600748.0  1.831051  0.000001  3.569110e-07  0.000002   
1     2.0  2.29191  621668.0  1.769433  0.000002  7.613710e-07  0.000003   
2     3.0  2.83939  603123.0  1.823840  0.000002  5.634350e-07  0.000002   
3     5.0  1.49193  643055.0  1.710585  0.000002  7.435170e-07  0.000003   
4     6.0  2.61366  594794.0  1.849380  0.000001  3.383130e-07  0.000002   

           f234  k(234/238)  f(234/238)  k238/f238  k234/f234  
0  6.522460e-07     1.36102     1.82748    3.55590    2.64826  
1  1.486580e-06     1.47537     1.95251    2.96499    2.24044  
2  7.432420e-07     1.15864     1.31913    3.07930    2.70467  
3  1.294850e-06     1.40869     1.74152    2.74026    2.21655  
4  7.950570e-07     1.49618     2.35006    4.05848    2.58384  
------------------------------------------------------------------------------------------
Profil 4b :        Fobj      Time         P      k238          f238      k234  \
0  2.843840  648864.0  3.082310  0.000002  1.054370e-06  0.000002   
1  2.231550  634212.0  3.153520  0.000001  5.326190e-07  0.000002   
2  0.430188  666818.0  2.999319  0.000002  9.547530e-07  0.000003   
3  0.661399  638529.0  3.132199  0.000002  9.483540e-07  0.000003   
4  1.122360  642197.0  3.114309  0.000002  9.604270e-07  0.000003   

           f234  k(234/238)  f(234/238)  k238/f238  k234/f234  
0  6.596870e-07    0.897576    0.625669    1.82428    2.61708  
1  8.498740e-07    1.636620    1.595650    1.99967    2.05100  
2  1.416360e-06    1.487470    1.483490    1.76741    1.77216  
3  1.463840e-06    1.509180    1.543550    1.81137    1.77103  
4  1.583830e-06    1.598310    1.649090    1.77160    1.71705  

III.2 - Statistical analysis

code python associé :

In [26]:
from IPython.display import display,clear_output
from ipywidgets import interact, IntSlider, Dropdown

val_stat = np.zeros((8,5,np.size(stats.t.fit(df_p2[var[1]]))))

#parameter_slider = IntSlider(min=0, max=4, step=1, value=0)
parameter_drop = Dropdown(options=[('Time y', 0), ('P m/My', 1), ('k238', 2), ('f238', 3), ('k234', 4), ('f234', 5),('k234/k238',6),('f234/f238',7)])
profile_drop   = Dropdown(options=[('Profile 2', 0), ('Profile 3a', 1), ('Profile 3b', 2), ('Profile 4a', 3)])
# decorate the plot function with an environment from the UIs:
@interact(parameter=parameter_drop, profile=profile_drop)
def plot(parameter, profile):
    fig, ax = plt.subplots(figsize=(15, 2))
    xlim = [1.0E+6,4.0,1.0E-5,1.0E-5,1.0E-5,1.0E-5,4.,4.]
    bins = 1 ; mean = 0. ; sigma = 1.
    if(profile==0):
        n, bins, patches = plt.hist(df_p2[var[parameter]], 400, density=1,color=colors[profile])
        val_stat[parameter,profile,:] = stats.t.fit(df_p2[var[parameter]])
        mean  = val_stat[parameter,profile,1]
        sigma = val_stat[parameter,profile,2]    
        if(parameter==0):
            val_stat[parameter,profile,1] =  np.mean(df_p2[var[parameter]])
            val_stat[parameter,profile,2] =  np.std(df_p2[var[parameter]]) 
    if(profile==1):
        n, bins, patches = plt.hist(df_p3[var[parameter]], 400, density=1,color=colors[profile])
        val_stat[parameter,profile,:] = stats.t.fit(df_p3[var[parameter]])
        #if(parameter==0):
            #val_stat[parameter,profile,1] =  np.mean(df_p3[var[parameter]])
            #val_stat[parameter,profile,2] =  np.std(df_p3[var[parameter]]) 
    if(profile==2):
        n, bins, patches = plt.hist(df_p4[var[parameter]], 400, density=1,color=colors[profile])
        val_stat[parameter,profile,:] = stats.t.fit(df_p4[var[parameter]])
        if(parameter==0):
            val_stat[parameter,profile,1] =  np.mean(df_p4[var[parameter]])
            val_stat[parameter,profile,2] =  np.std(df_p4[var[parameter]]) 
    if(profile==3):
        n, bins, patches = plt.hist(df_p5[var[parameter]], 400, density=1,color=colors[profile]) 
        val_stat[parameter,profile,:] = stats.t.fit(df_p5[var[parameter]])
        if(parameter==0):
            val_stat[parameter,profile,1] =  np.mean(df_p5[var[parameter]])
            val_stat[parameter,profile,2] =  np.std(df_p5[var[parameter]]) 
    mean  = val_stat[parameter,profile,1]
    sigma = val_stat[parameter,profile,2]                                      
    ax.plot(bins, stats.norm.pdf(bins, mean, sigma))
    ax.axvline(x=mean,c=colors[profile],ls='dashed')
    ax.axvspan(mean-sigma, mean+sigma,alpha=0.3,facecolor=colors[profile])
    ax.set_xlabel('Parameter : %s'%var[parameter])
    ax.text(np.max(bins)/1.5,np.max(n)/2., r'$\mu$ = %e'%mean)
    ax.text(np.max(bins)/1.5,np.max(n)/3., r'$\pm\sigma$= %e'%sigma)
    ax.set_xlim(0.0,np.max(bins))
    ax.set_ylabel('Probability')
    ax.set_title('Profile : %s '%labels[profile])
    print('%s'%labels[profile],'==> mean = %e'%mean,'| standard deviation ','= %e'%sigma)

plt.show()

IV - Mean $\mu$ and standard deviation $\sigma$ determination

IV.1 - Results for file zone_2.csv :

Parameter Time (y) $p$ (m/My) $k_{238}$ (1/s) $f_{238}$ (1/s) $k_{234}$ (1/s) $f_{234}$ (1/s) $\dfrac{k_{238}}{k_{234}}$ $\dfrac{f_{238}}{f_{234}}$
Mean $\mu$ $891~224$ $2.207958$ $5.373379~10^{-7}$ $17.44775~10^{-7}$ $9.595960~10^{-7}$ $24.93313~10^{-7}$ 1.977343 1.435569
Std deviation $\sigma~\pm$ $99~528$ $0.2095690$ $1.942053~10^{-7}$ $2.301647~10^{-7}$ $2.937391~10^{-7}$ $3.146862~10^{-7}$ 0.3963533 0.1365413

IV.2 - Results for file zone_3a.csv :

Parameter Time (y) $p$ (m/My) $k_{238}$ (1/s) $f_{238}$ (1/s) $k_{234}$ (1/s) $f_{234}$ (1/s) $\dfrac{k_{238}}{k_{234}}$ $\dfrac{f_{238}}{f_{234}}$
Mean $\mu$ $303~840$ $2.664715$ $6.794840~10^{-7}$ $20.56547~10^{-7}$ $9.137099~10^{-7}$ $23.59388~10^{-7}$ 1.353190 1.150026
Std deviation $\sigma~\pm$ $39~600$ $0.2348235$ $1.704398~10^{-7}$ $2.243455~10^{-7}$ $2.345804~10^{-7}$ $3.095209~10^{-7}$ 0.2558826 0.1128398

IV.3 - Results for file zone_3b.csv :

Parameter Time (y) $p$ (m/My) $k_{238}$ (1/s) $f_{238}$ (1/s) $k_{234}$ (1/s) $f_{234}$ (1/s) $\dfrac{k_{238}}{k_{234}}$ $\dfrac{f_{238}}{f_{234}}$
Mean $\mu$ $590~150$ $1.846073$ $24.17446~10^{-7}$ $9.921260~10^{-7}$ $30.49398~10^{-7}$ $14.29396~10^{-7}$ 1.287298 1.511653
Std deviation $\sigma~\pm$ $21~629$ $0.5859635$ $11.41665~10^{-7}$ $7.806470~10^{-7}$ $16.33527~10^{-7}$ $11.73991~10^{-7}$ 0.1414916 0.2558008

IV.4 - Results for file zone_4a.csv :

Parameter Time (y) $p$ (m/My) $k_{238}$ (1/s) $f_{238}$ (1/s) $k_{234}$ (1/s) $f_{234}$ (1/s) $\dfrac{k_{238}}{k_{234}}$ $\dfrac{f_{238}}{f_{234}}$
Mean $\mu$ $791~312$ $2.334735$ $12.43773~10^{-7}$ $5.232120~10^{-7}$ $17.86370~10^{-7}$ $7.210024~10^{-7}$ 1.516394 1.610835
Std deviation $\sigma~\pm$ $158~785$ $0.2347057$ $2.733054~10^{-7}$ $1.994557~10^{-7}$ $3.112318~10^{-7}$ $2.416315~10^{-7}$ 0.2982813 0.5624723

V - Parameters distribution by profile

V.1 - Production rate $p$ :

In [67]:
# Create distplot with curve_type set to 'normal'
fig = ff.create_distplot([df_p2['P'],df_p3['P'],df_p4['P'],df_p5['P']],labels, bin_size=0.01,
                         curve_type='kde', # override default 'kde'
                         colors=colors,show_rug=False,show_hist=True)
for l in range(0,4):
    print("Mean : %f"%val_stat[1,l,1]," $+/- std dev : %f"%val_stat[1,l,2])
    fig.add_trace(
    go.Scatter(
        x=[val_stat[1,l,1], val_stat[1,l,1]],
        y=[0., 8],
        name="Mean : %s "%labels[l],
        mode="lines",
        line=go.scatter.Line(color=colors[l],dash="dot"),
        showlegend=True)
    )

    # Add Std deviation
    fig.update_layout(
        shapes=[
            # filled Rect Std dev prof 1
            go.layout.Shape(type="rect",
                x0=val_stat[1,0,1]-val_stat[1,0,2],
                y0=0,
                x1=val_stat[1,0,1]+val_stat[1,0,2],
                y1=8,
                name="Std dev : %s "%labels[0],line=dict(color=colors[0],width=0),fillcolor=colors[0],
                opacity=0.2,layer="below"),
            # filled Rect Std dev prof 2
            go.layout.Shape(type="rect",
                x0=val_stat[1,1,1]-val_stat[1,1,2],
                y0=0,
                x1=val_stat[1,1,1]+val_stat[1,1,2],
                y1=8,
                name="Std dev : %s "%labels[1],line=dict(color=colors[1],width=0),fillcolor=colors[1],
                opacity=0.2,layer="below"),
            # filled Rect Std dev prof 3
            go.layout.Shape(type="rect",
                x0=val_stat[1,2,1]-val_stat[1,2,2],
                y0=0,
                x1=val_stat[1,2,1]+val_stat[1,2,2],
                y1=8,
                name="Std dev : %s "%labels[2],line=dict(color=colors[2],width=0),fillcolor=colors[2],
                opacity=0.2,layer="below"),
            # filled Rect Std dev prof 4
            go.layout.Shape(type="rect",
                x0=val_stat[1,3,1]-val_stat[1,3,2],
                y0=0,
                x1=val_stat[1,3,1]+val_stat[1,3,2],
                y1=8,
                name="Std dev : %s "%labels[3],line=dict(color=colors[3],width=0),fillcolor=colors[3],
                opacity=0.2,layer="below"),
            
        ]
    )

# Add title
fig.update_layout(title_text='KDE distribution for Production rate')
fig.show()
Mean : 2.207958  $+/- std dev : 0.209569
Mean : 2.664715  $+/- std dev : 0.234824
Mean : 1.846073  $+/- std dev : 0.585963
Mean : 2.334735  $+/- std dev : 0.234706

V.2 - Final time $t_{final}$ :

In [8]:
# Create distplot with curve_type set to 'normal'
fig = ff.create_distplot([df_p2['Time'],df_p3['Time'],df_p4['Time'],df_p5['Time']],labels, bin_size=10000.,
                         curve_type='kde', # override default 'kde'
                         colors=colors,show_rug=False,show_hist=True)
# Add title
fig.update_layout(title_text='KDE distribution for Final time')
fig.show()

VI - Parameters correlation

In [9]:
import pandas as pd
from plotly import plotly
from plotly import tools
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, plot, iplot
init_notebook_mode()
import numpy as np
prof= pd.read_csv('zone_2.csv',names=['ID_BEE','Fobj','Time','P','k238','f238','k234','f234','k(234/238)','f(234/238)','k238/f238','k234/f234'],encoding='utf-8',delimiter="\t",dtype=np.float64)
df= pd.DataFrame(prof,columns=['Fobj','Time','P','k238','f238','k234','f234','k(234/238)','f(234/238)'])

fig = go.Figure(data=[go.Scatter3d(x=df['P'], y=df['k238'], z=df['f238'],mode='markers',marker=dict(
    size=1,
    color=df['Fobj'],                # set color to an array/list of desired values
    colorscale='Jet',   # choose a colorscale
    colorbar=dict(
        title="Fobj"
    ),
    opacity=0.8))])
iplot(fig, filename = 'graph_3.html')
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-9-04329e4a965c> in <module>
      1 import pandas as pd
----> 2 from plotly import plotly
      3 from plotly import tools
      4 import plotly.graph_objs as go
      5 from plotly.offline import init_notebook_mode, plot, iplot

~/.local/lib/python3.6/site-packages/plotly/plotly/__init__.py in <module>
      2 from _plotly_future_ import _chart_studio_error
      3 
----> 4 _chart_studio_error("plotly")

~/.local/lib/python3.6/site-packages/_plotly_future_/__init__.py in _chart_studio_error(submodule)
     47 chart_studio.{submodule} module instead.
     48 """.format(
---> 49             submodule=submodule
     50         )
     51     )

ImportError: 
The plotly.plotly module is deprecated,
please install the chart-studio package and use the
chart_studio.plotly module instead. 
In [8]:
import plotly.express as px
fig = px.scatter_3d(df_p2,
                    x='P',
                    y='k238',
                    z='f238',
                    color='Fobj',
                    size='Fobj',
                    size_max=10,
                    color_continuous_scale=px.colors.sequential.Jet)
fig.show()
In [9]:
fig = px.scatter_matrix(df_p2,
                        dimensions=['Time','P','k238','f238','Fobj','Fobj'],
                        color='Fobj',
                        size='Fobj',
                        size_max=2.5,opacity=0.5,
                        color_continuous_scale=px.colors.sequential.Jet)
fig.show()